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In this paper, word sense disambiguation (WSD) accuracy achievable by a probabilistic 
classifier, using very minimal training sets, is investigated. We made the assumption that 
there are no tagged corpora available and identified what information, needed by an 
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Jordan 

March 2003 The Journal of Machine Learning Research, Volume 3 

Full text available: ^ pdff789.04 KB) Additional Information: full citation , abstract, citings, index terms 

We present a new approach for modeling multi-modal data sets, focusing on the specific 
case of segmented images with associated text. Learning the joint distribution of image 
regions and words has many applications. We consider in detail predicting words associated 
with whole images (auto-annotation) and corresponding to particular image regions (region 
naming). Auto-annotation might help organize and access large collections of images. 
Region naming is a model of object recognition as a process ... 

6 Text Extraction and Summarization: Combining multiple classifiers for text 
categorization 

Khalid Al-Kofahi, Alex Tyrrell, Arun Vachher, Tim Travers, Peter Jackson 
October 2001 Proceedings of the tenth international conference on Information and 
l<nowledge management 

Full text available- S odfn 94 MB) Additional Information: full citation , abstract , references , citings , index 
^ ' ^ temis 

A major problem facing online information services is how to index and supplement large 
document collections with respect to a rich set of categories. We focus upon the routing of 
case law summaries to various secondary law volumes in which they should be cited. Given 
the large number (> 13,000) of closely related categories, this is a challenging tasl< that is 
unlikely to succumb to a single algorithmic solution. Our fully implemented and recently 
deployed system shows that a superior classifica ... 

Keywords: document classification, multi-classifier 
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Dialogue act modeling for automatic tagging and recognition of conversational speech 
Andreas Stolcl<e, Noah Coccaro, Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, Klaus 
Ries, Elizabeth Shrlberg, Daniel Jurafsky, Rachel Martin, Marie Meteer 
September 2000 Computational Linguistics, volume 26 issue 3 

Full text available: ^ .... . 

■gj pa}[z.t)^ Mb? ^ Additional Information: full citation, abstract , references , citings 

Publisher Site 

We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., 
speech-act-like units such as STATEMENT, QUESTION, BACKCHANNEL, AGREEMENT, 
DISAGREEMENT, and APOLOGY. Our model detects and predicts dialogue acts based on 
lexical, collocational, and prosodic cues, as well as on the discourse coherence of the 
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dialogue act sequence. The dialogue model Is based on treating the discourse structure of a 
conversation as a hidden ... 

8 Improving accuracy in word class tagging through the combination of machine learning 
systems 

Hans van Halteren, Walter Daelemans, Jakub Zavrel 
June 2001 Computational Linguistics, volume 27 issue 2 

Full text available: ^ . 0-7 ii d\ 

^ paT(Z.37 Mb) ^ Additional Information: full citation , abstract , references 

Publisher Site 

We examine how differences in language models, learned by different data-driven systems 
performing the same NLP task, can be exploited to yield a higher accuracy than the best 
Individual system. We do this by means of experiments Involving the tasl< of 
morphosyntactic word class tagging, on the basis of three different tagged corpora. Four 
well-known tagger generators (hidden Markov model, memory-based, transformation rules, 
and maximum entropy) are trained on the same corpus data. After comparis ... 

9 Semantic annotation and integration: Towards the self-annotating web 
Philipp Cimiano, Siegfried Handschuh, Steffen Staab 

May 2004 Proceedings of the 13th international conference on World Wide Web 

Full text available: ^ pdf(374.83 KB) Additional Infomiation: full citation , abstract, references , index terms 

The success of the Semantic Web depends on the availability of ontologies as well as on the 
proliferation of web pages annotated with metadata conforming to these ontologies. Thus, a 
crucial question is where to acquire these metadata from. In this paper wepropose PANKOW 
(Pattern-based Annotation through Knowledge on theWeb), a method which employs an 
unsupervised, pattern-based approach to categorize instances with regard to an ontology. 
The approach is evaluated against the manual annotations ... 

Keywords: Information extraction, metadata, semantic annotation, semantic web 



^0 Hierarchical indexing and document matching in BoW 
Maayan Geffet, Dror G. Feitelson 

January 2001 Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries 

Full text available: ^ pdf(434.24 KB) Additional Information: full citation , abstract, references , index terms 

BoW is an on-line bibliographical repository based on a hierarchical c oncept Index to which 
entries are linked. Searching in the repository should therefore return matching topics from 
the hierarchy, rather than just a list of entries. Likewise, when new entries are inserted, a 
search for relevant topics to which they should be linked is required. We develop a vector- 
based algorithm that creates keyword vectors for the set of competing topics at each node 
in the hierarchy, and show how its ... 

Search improvement via automatic query reformulation 
Susan Gauch, John B. Smith 

July 1991 ACM Transactions on Information Systems (TOIS), volume 9 issue 3 

Full text available: S pclf(2.28 MB) Additional Information: full citation , references, citings, index terms, review 



Keywords: Expert Systems, full-text information retrieval, online search assistance, query 
reformulation, textbases 



12 Data mining: Mining the peanut gallery: opinion extraction and semantic classification Q 
of product reviews 

Kushal Dave, Steve Lawrence, David M. Pennock 

May 2003 Proceedings of the twelfth international conference on World Wide Web 
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Full text available: ISa Ddff327.95 KB) Additional Information: full citation , abstrart. references, cftings. iridic 
^ terms 

The web contains a wealth of product reviews, but sifting through them is a daunting task. 
Ideally, an opinion mining tool would process a set of search results for a given item, 
generating a list of product attributes (quality, features, etc.) and aggregating opinions 
about each of them (poor, mixed, good). We begin by identifying the unique properties of 
this problem and develop a method for automatically distinguishing between positive and 
negative reviews. Our classifier draws on information ... 

Keywords: document classification, opinion mining 



13 Integrating prosodic and lexical cues for automatic topic segmentation 
Gokhan Tur, Andreas Stolcke, Dilek Hakkani-Tur, Elizabeth Shriberg 
I\/Iarch2001 Computational LinguisticSi Volume 27 issue i 

Full text available: ^ .... ^_ . M 

1^paT(i.g/ Mb)^ Additional Information: full citation, abstract, references 
Publisher Site 

We present a probabilistic model that uses both prosodic and lexical cues for the automatic 
segmentation of speech into topically coherent units. We propose two methods for 
combining lexical and prosodic information using hidden Markov models and decision trees. 
Lexical information is obtained from a speech recognizer, and prosodic features are 
extracted automatically from speech waveforms. We evaluate our approach on the 
Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results sho ... 

14 Technical best paper contest session: Multi-level annotation of natural scenes using 
dominant Image components and semantic concepts 

Jianping Fan, Yuli Gao, Hangzai Luo 

October 2004 Proceedings of the 12th annual ACM international conference on 
Multimedia 

Full text available: ^pdfn.27 MB) Additional Information: full citation , abstract , references , index terms 

Automatic image annotation is a promising solution to enable semantic image retrieval via 
keywords. In this paper, we propose a multi-level approach to annotate the semantics of 
<b><i>natural scenes</i></b> by using both the dominant image components (salient 
objects) and the relevant semantic concepts. To achieve automatic image annotation at the 
content level, we use salient objects as the dominant image components for image content 
representation and feature extraction. To ... 

Keywords: adaptive EM algorithm, automatic image annotation, salient objects 



^5 A model of multimedia information retrieval 
Carlo Meghini, Fabrizio Sebastiani, Umberto Straccia 
September 2001 Journal of the ACM (JACM), Volume 48 issue 5 

Full text available* IB Ddf(5 69 MB) Additional Information: full citation , abstract, references , citings , index 

terms 

Research on multimedia Information retrieval (MIR) has recently witnessed a booming 
interest. A prominent feature of this research trend is its simultaneous but independent 
materialization within several fields of computer science. The resulting richness of 
paradigms, methods and systems may, on the long run, result in a fragmentation of efforts 
and slow down progress. The primary goal of this study is to promote an integration of 
methods and techniques for MIR by contributing a conceptual model ... 

Keywords: Description logics, fuzzy logics, multimedia information retrieval 
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16 Automatic generation of concise summaries of spoken dialogues in unrestricted 

domains 
Klaus Zechner 

September 2001 Proceedings of the 24th annual international ACM SIGXR conference on 
Research and development in information retrieval 

Full text available: ^ pdf(209.67 KB) Additional Information: full citation , abstract , references , index terms 

Automatic summarization of open domain spoken dialogues Is a new research area. This 
paper Introduces the task, the challenges Involved, and presents an approach to obtain 
automatic extract summaries for multi-party dialogues of four different genres, without any 
restriction on domain. We address the following issues which are intrinsic to spoken dialogue 
summarization and typically can be Ignored when summarizing written text such as 
newswire data: (1) detection and removal of speech disfl ... 



17 Question-answering by predictive annotation Q 
John Prager, Eric Brown, Anni Coden, Dragomir Radev 

July 2000 Proceedings of the 23rd annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: fl pdf(794.29 KB) Additional Information: full citation , abstract, references , dtings. index 
^^^-^ terms 

We present a new technique for question answering called Predictive Annotation, Predictive 
Annotation identifies potential answers to questions in text, annotates them accordingly and 
indexes them. This technique, along with a complementary analysis of questions, passage- 
level ranking and answer selection, produces a system effective at answering natural- 
language fact-seeking questions posed against large document collections. Experimental 
results show the effects of different parameter setti ... 



^8 RCV1: A New Benchmark Collection for Text Categorization Research 
David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li 
August 2004 The Journal of Machine Learning Research, Volume 5 

Full text available: pdf(628.29 KB) Additional Information: full citation , abstract , citings . Index terms 

Reuters Corpus Volume I (RCVl) is an archive of over 800,000 manually categorized 
newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this 
data for research on text categorization requires a detailed understanding of the real world 
constraints under which the data was produced. Drawing on interviews with Reuters 
personnel and access to Reuters documentation, we describe the coding policy and quality 
control procedures used in producing the RCVl data, the inten ... 

^9 Term clustering of syntactic phrases 
D. D. Lewis, W. B. Croft 

December 1989 Proceedings of the 13th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available- ^ pdf(1 .62 MB) Additional Information: full citation, abstract, references, citings , index 
^ terms 

Term clustering and syntactic phrase formation are methods for transforming natural 
language text. Both have had only mixed success as strategies for improving the quality of 
text representations for document retrieval. Since the strengths of these methods are 
complementary, we have explored combining them to produce superior representations. In 
this paper we discuss our implementation of a syntactic phrase generator, as well as our 
preliminary experiments with producing phrase clusters. Th ... 



20 Establishing the semantic web 1: SemTag and seeker: bootstrapping the semantic web J 
via automated semantic annotation 

Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas 

Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien 

May 2003 Proceedings of the twelfth international conference on World Wide Web 

Additional Information: full citation, abstract , references , citings , index 
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This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an 
application written on the platform to perform automated semantic tagging of large corpora. 
We apply SemTag to a collection of approximately 264 million web pages, and generate 
approximately 434 million automatically disambiguated semantic tags, published to the web 
as a label bureau providing metadata regarding the 434 million annotations. To our 
knowledge, this is the largest scale semantic tagging effort to ... 

Keywords: automated semantic tagging, data mining, information retrieval, large text 
datasets, text analytics 
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21 Computer Evaluation of Indexing and Text Processing 
G. Salton, M. E. Lesk 

January 1968 Journal of the ACM (J ACM), volume 15 issue i 

Additional Information: full citation, abstract , references , citings . Index 
terms 



Full text available: gpdf(2. 19 MB) 



Automatic indexing nnethods are evaluated and design criteria for modern information 
systems are derived. 

22 iviodelling experience: Computational models for experiences in the arts, and 
multimedia 

Harini Sridharan, Hari Sundaram, Thanassis Rlkakis 

November 2003 Proceedings of the 2003 ACM SIGMM workshop on Experiential 
telepresence 

Full text available: ^pdf(601.92 KB) Additional Information: full citation , abstract , references , index terms 

In this paper, we develop formal computational models for three aspects of experiential 
systems for browsing media ~ (a) context (b) interactivity through hyper-mediation and (c) 
context evolution using a memory model. Experiential systems deal with the problem of 
developing context adaptive mechanisms for knowledge acquisition and insight. Context is 
modeled as a union of graphs whose nodes represent concepts and where the edges 
represent the semantic relationships. The system context is the u ... 

Keywords: art, context, experiential models, hyper-mediatlon, implications, semantic nets, 
word Net 



23 Clustering: Probabilistic combination of text classifiers using reliability indicators: 
models and results 

Paul N. Bennett, Susan T. Dumais, Eric Horvltz 

August 2002 Proceedings of the 25th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: ^pdf(126.99 KB) Additional Information: full citation , abstract , references , index terms 

The Intuition that different text classifiers behave in qualitatively different ways has long 
motivated attempts to build a better metaclassifier via some combination of classifiers. We 
introduce a probabilistic method for combining classifiers that considers the context- 
sensitive reliabilities of contributing classifiers. The method harnesses reliability indicators — 
variables that provide a valuable signal about the performance of classifiers In different 
situations. We provide backgrou ... 
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24 Automatic text categorization in terms of genre and author 

Efstathios Stamatatos, George Kokkinakis, Nikos Fakotakis 
December 2000 Computational Linguistics, volume 26 issue 4 

Full text available:^ ..,.^r^KMo^^ 

■gjpdf(l.30MB)*a:P! Additional Infomiatlon: full citation, abstract, references, citings 

Publisher Site 

The two main factors that characterize a text are its content and its style, and both can be 
used as a means of categorization. In this paper we present an approach to text 
categorization in terms of genre and author for r^odern Greek. In contrast to previous 
stylometric approaches, we attempt to take fuli advantage of existing natural language 
processing (NLP) tools. To this end, we propose a set of style markers including analysis- 
level measures that represent the way in which the input text ha ... 

25 Automatically extracting highlights for TV Baseball programs 

Yong Rui, Anoop Gupta, Alex Acero 

October 2000 Proceedings of the eighth ACM international conference on Multimedia 

Full text available: ■ ppdfd.OS MB) Additional Infomiation: full citation, abstract, references, citings, index 
^ terms 

In today's fast- paced world, while the number of channels of television programming 
available is increasing rapidly, the time available to watch them remains the same or is 
decreasing. Users desire the capability to watch the programs time-shifted (on-demand) 
and/or to watch just the highlights to save time. In this paper we explore how to provide for 
the latter capability, that is the ability to extract highlights automatically, so that viewing 
time can be reduced. 

We focus on the sp ... 

Keywords: audio, baseball, highlights, summarization, television, video 



26 Multimodal communication: Multimodal model integration for sentence unit detection Q 
Mary P. Harper, Elizabeth Shriberg 

October 2004 Proceedings of the 6th international conference on Multimodal interfaces 

Full text available: ^ pdf(469.02 KB) Additional Information: full citation , abstract , references , index terms 

In this paper, we adopt a direct modeling approach to utilize conversational gesture cues in 
detecting sentence boundaries, called SUs, in video taped conversations. We treat the 
detection of SUs as a classification task such that for each inter-word boundary, the 
classifier decides whether there is an SU boundary or not. In addition to gesture cues, we 
also utilize prosody and lexical knowledge sources. In this first investigation, we find that 
gesture features complement the prosodic and le ... 

Keywords: dialog, gesture, language models, multimodal fusion, prosody, sentence 
boundary detection 



27 Text categorization: Text categorization by boosting automatically extracted concepts Q 
Lijuan Cai, Thomas Hofmann 

July 2003 Proceedings of the 26th annual international ACM SIGIR conference on 
Research and development in informaion retrieval 

Full text available: ^ pdf(237.53 KB) Additional Information: full citation , abstract , references , index terms 

Term-based representations of documents have found wide-spread use in information 
retrieval. However, one of the main shortcomings of such methods Is that they largely 
disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to 
variations in word usage. In this paper we investigate the use of concept-based document 
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representations to supplement word- or phrase-based features. The utilized concepts are 
automatically extracted from documents via probabilistic late ... 

Keywords: boosting, classification, concept extraction, document categorization, lexical 
semantics, machine learning 

28 Special issue on word sense disambiguation: Using corpus statistics and WordNet Q 

relations for sense identification 

Claudia Leacock, George A. Miller, Martin Chodorow 

March 1998 Computational Linguistics, Volume 24 issue i 

Full text available: 



gpdf(1.35MB)MI! Additional Information: full citation , abstract , references , citings 
Publisher Site 

Corpus-based approaches to word sense identification have flexibility and generality but 
suffer from a knowledge acquisition bottleneck. We show how knowledge-based techniques 
can be used to open the bottleneck by automatically locating training corpora. We describe a 
statistical classifier that combines topical context with local cues to identify a word sense. 
The classifier is used to disambiguate a noun, a verb, and an adjective. A knowledge base in 
the form of WordNet*s lexical relations is ... 

2^ From reading to retrieval: freeform ink annotations as queries 
Gene Golovchinsky, Morgan N. Price, Bill N. Schllit 

August 1999 Proceedings of the 22nd annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: S pdf(236.86 KB) Additional Information: full citation , references, citings , index terms 



Keywords: annotation-based queries, digital libraries, empirical evaluation, freeform digital 
ink, information appliances, information exploration, information retrieval, query expansion, 
query-mediated browsing, relevance feedback, user studies 



30 Special issue on word sense disambiguation: Introduction to the special issue on word 
sense disambiguation: the state of the art 
Nancy Ide, Jean Veronis 

March 1998 Computational Linguistics, volume 24 issue i 

Full text available: jSQ 

^ pdf(3.44 MB) ^ Additional Information: full citation , references , citings 

Publisher Site 



31 Finding factors: learning to classify case opinions under abstract fact categones ||| 
Stefanie Bruninghaus, Kevin D. Ashley 

June 1997 Proceedings of tlie sixtli international conference on Artificial intelligence 
and law 

Full text available: ^ pdff1.27 MB) Additional Information: full citation , references, citings, index terms 



32 Local Feedback in Full-Text Retrieval Systems 
R. Attar, A. S. Fraenkel 

July 1977 Journal of the ACM (JACM), volume 24 issue 3 

Full text available: ^ pdf(136 MB) Additional Information: full citation, references , citings, index terms 
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New Techniques for Disambiguation in Natural Language and Their Application to 
Biological Text 

Filip Ginter, Jorma Boberg, Jouni Jarvinen, Tapio Salakoski 

August 2004 The Journal of Machine Learning Research, volume 5 

Full text available: ^ pdf{208.77 KB) Additional Information: full citation , abstract, index terms 

We study the problems of disambiguation in natural language, focusing on the problem of 
gene vs. protein name disambiguation in biological text and also considering the problem of 
context-sensitive spelling error correction. We introduce a new family of classifiers based on 
ordering and weigliting the feature vectors obtained from word counts and word co- 
occurrence in the text, and inspect several concrete classifiers from this family. We obtain 
the most accurate prediction when weighting by posi ... 

34 Special issue on word sense disambiguation: Automatic word sense discrimination 
Hinrich Schutze 

March 1998 Computational Linguistics, volume 24 issue i 

Full text available: „ .... n-r n/iox ^ 

'^paT(l.a/ MH)^ Additional Information: full citation , abstract , references , citings 

Publisher Site 

This paper presents context-group discrimination, a disambiguation algorithm based on 
clustering. Senses are interpreted as groups (or clusters) of similar contexts of the 
ambiguous word. Words, contexts, and senses are represented in Word Space, a high- 
dimensional, real-valued space in which closeness corresponds to semantic similarity. 
Similarity in Word Space is based on second-order co-occurrence: two tokens (or contexts) 
of the ambiguous word are assigned to the same sense cluster if the wo ... 

35 An automatic hierarchical image classification scheme 
Jing Huang, S. Ravi Kumar, Ramin Zabih 

September 1998 Proceedings of the sixth ACM international conference on Multimedia 

Full text available: B pdf(855.42 KB^ Additional Infomiation: full citation , references, citings , index terms 



36 Information extraction: Unsupervised learning of soft patterns for generating definitions Q 
from online news 

Hang Cui, Min-Yen Kan, Tat-Seng Chua 

May 2004 Proceedings of the 13th international conference on World Wide Web 

Full text available: ^pdf(165.57 KB) Additional Information: full citation , abstract , references , index terms 

Breaking news often contains timely definitions and descriptions of current terms, 
organizations and personalities. We utilize such web sources to construct definitions for such 
terms. Previous worl< has identified definitions using hand-crafted rules or supervised 
learning that constructs rigid, hard text patterns. In contrast, we demonstrate a new 
approach that uses flexible, soft matching patterns to characterize definition sentences. Our 
soft patterns are able to effectively accommodate the d ... 

Keywords: definition generation, definitional question answering, pseudo-relevance 
feedbacl<, soft patterns, unsupervised learning 



37 industry track papers: Learning domain-independent string transformation weights for 
high accuracy object identification 
Sheila Tejada, Craig A. Knoblock, Steven Minton 

July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: 1SiDdff1.12MB) Additional Information: full citation , abstract, references , citings, index 
^ terms 

The task of object identification occurs when integrating information from multiple websites, 
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The same data objects can exist in inconsistent text formats across sites, making It difficult 
to Identify matching objects using exact text match. Previous methods of object 
Identification have required manual construction of domain-specific string transformations or 
manual setting of general transformation parameter weights for recognizing format 
inconsistencies. This manual process can be time consum ... 

38 Seinantic web services: Meteor-s web service annotation framework Q 
Abhijit A. Patil, Swapna A. Oundhakar, Amit P. Sheth, Kunal Verma 

May 2004 Proceedings of the 13th international conference on World Wide Web 

Full text available: ^ pdfM.IO MB) Additional Information: full citation , abstract , references , index terms 

The World Wide Web Is emerging not only as an infrastructure for data, but also for a 
broader variety of resources that are increasingly being made available as Web services. 
Relevant current standards like UDDI, WSDL, and SOAP are in their fledgling years and form 
the basis of making Web services a workable and broadly adopted technology. However, 
realizing the fuller scope of the promise of Web services and associated service oriented 
architecture will requite further technological advances in ... 

Keywords: ontology, semantic annotation of web services, semantic web services, web 
services discovery, wsdl 

39 Document analysis: Visual signature based identification of Low-resolution document Q 
images 

Ardhendu Behera, Denis Lalanne, Rolf Ingold 

October 2004 Proceedings of the 2004 ACM symposium on Document engineering 

Full text available: " ^pdfCZ.OO MB) Additional Information: full citation , abstract , references , index terms 

In this paper, we present (a) a method for Identifying documents captured from low- 
resolution devices such as web-cams, digital cameras or mobile phones and (b) a technique 
for extracting their textual content without performing OCR. The first method associates a 
hierarchically structured visual signature to the low-resolution document image and further 
matches It with the visual signatures of the original high-resolution document images, 
stored in PDF form In a repository. The matching algor ... 

Keywords: document visual signature, document-based meeting retrieval, documents' 
content extraction, low-resolution document image identification 

40 Student best paper contest: Confidence-based dynamic ensemble for imag e Q 
annotation and semantics discovery 

Beltao LI, Kingshy Goh 

November 2003 Proceedings of the eleventh ACM international conference on 
Multimedia 

Full text available: ^ pdf(275.37 KB) Additional Information: full citation , abstract , references , index terms 

Providing accurate and scalable solutions to map low-level perceptual features to high-level 
semantics is critical for multimedia information organization and retrieval. In this paper, we 
propose a confidence-based dynamic ensemble (CDE) to overcome the shortcomings of the 
traditional static classifiers. In contrast to the traditional models, CDE can make dynamic 
adjustments to accommodate new semantics, to assist the discovery of useful low-level 
features, and to improve class-prediction ... 
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Poster papers: Incremental context mining for adaptive document classification 
Rey-Long Liu, Yun-Ling Lu 

July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^pdf(641.82 KB) Additional Information: full citation , abstract , references , index terms 

Automatic document classification (DC) is essential for the management of information and 
knowledge. This paper explores two practical issues in DC: (1) each document has its 
context of discussion, and (2) both the content and vocabulary of the document database is 
intrinsically evolving. The issues call for adaptive document classification (ADC) that adapts 
a DC system to the evolving contextual requirement of each document category, so that 
input documents may be classifie ... 



Keywords: adaptive document classification, context text mining, incremental mining 



Multldocument summarization: An added value to clustering in interactive retrieval 

Manuel J. Mana-Lopez, Manuel De Buenaga, Jose M. Gomez-Hidalgo 

April 2004 ACM Transactions on Information Systems (TOIS), Volume 22 issue 2 

Full text available: ^ pdfd 99.91 KB) Additional Information: full citation , abstract, references , index terms 

A more and more generalized problem in effective information access is the presence in the 
same corpus of multiple documents that contain similar information. Generally, users may 
be interested In locating, for a topic addressed by a group of similar documents, one or 
several particular aspects. This kind of task, called instance or aspectual retrieval, has been 
explored in several TREC Interactive Tracks. In this article, we propose in addition to the 
classification capacity of clustering techn ... 

Keywords: Multldocument summarization, topic segmentation 



The interaction of knowledge sources in word sense disannbiguation 
Mark Stevenson, Yorick Wilks 

September 2001 Computational Linguistics, volume 27 issue 3 

Full text available: ^ c kadn iSl 

^pdT(Z.i6 MB)^ Additional Infomnation: full citation , abstract, references 
Publisher Site 

Word sense disambiguation (WSD) is a computational linguistics task likely to benefit from 
the tradition of combining different knowledge sources in artificial in telllgence research. An 
important step in the exploration of this hypothesis is to determine which linguistic 
knowledge sources are most useful and whether their combination leads to improved 
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results. We present a sense tagger which uses several knowledge sources. Tested accuracy 
exceeds 94% on our evaluation corpus. Our system attempts ... 

^ Biclustering Algorithms for Biological Data Analysis: A Survey Q 
Sara C. Madeira, Arlindo L. Oliveira 

January 2004 IEEE/ ACM Transactions on Computational Biology and Bioinformatics 

(TCBB), Volume 1 Issue 1 
Full text available: fi^odfd.ZS MB) Additional Information: full citation 



Keywords: Biclustering, simultaneous clustering, coclustering, subspace clustering, 

bidimensional clustering, direct clustering, block clustering, two-way clustering, two-mode 
clustering, two-sided clustering, microarray data analysis, biological data analysis, gene 
expression data. 



45 Towards the automatic identification of adjectival scales: clustering adjectives 
according to meaning 

Vasileios Hatzivassiloglou, Kathleen R. McKeown 

June 1993 Proceedings of the 31st conference on Association for Computational 
Linguistics 

Full text available: a .,,4 nA kmd\ 

■[^paT(i.04 MB)^ Additional Information: full citation , abstract , references , citings 
Publisher Site 

In this paper we present a method to group adjectives according to their meaning, as a first 
step towards the automatic identification of adjectival scales. We discuss the properties of 
adjectival scales and of groups of semantically related adjectives and how they imply 
sources of linguistic knowledge in text corpora. We describe how our system exploits this 
linguistic knowledge to compute a measure of similarity between two adjectives, using 
statistical techniques and without having access to ... 

46 Artificial intelligence #1 : SA MetaMatch: relevant document discovery through 
document metadata and indexing 

Hiu S. Yau, J. Scott Hawker 

April 2004 Proceedings of the 42nd annual Southeast regional conference 

Full text available: ^ pdf(429.27 KB) Additional Information: full citation , abstract , references , index terms 

SA_MetaMatch, a component of the Standards Advisor (SA), is designed to find relevant 
documents through matching indices of metadata and document content. The elements In 
the metadata schema are mainly adopted from the Dublin Core (DC). The implementation of 
the XML metadata schema and coding follows the DC recommended guidelines. After 
metadata is generated manually for an unstructured document, or is extracted automatically 
from documents of well defined layout, they are stored in metadata fil ... 

Keywords: Dublin Core, document matching, index, metadata 



47 Technicial session 5: student best paper contest: LyricAlly: automatic synchronization Q 
of acoustic musical signals and textual lyrics 
Ye Wang, Min-Yen Kan, Tin Lay Nwe, Arun Shenoy, ]un Yin 

October 2004 Proceedings of the 12th annual ACM international conference on 
Multimedia 

Full text available: ^Pdf(485.10 KB) Additional Information: full citation , abstract, references, index terms 

We present a prototype that automatically aligns acoustic musical signals with their 
corresponding textual lyrics, in a manner similar to manually-aligned karaoke. We tackle this 
problem using a multimodal approach, where the appropriate pairing of audio and text 
processing helps create a more accurate system. Our audio processing technique uses a 
combination of top-down and bottom-up approaches, combining the strength of low-level 
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audio features and high-level musical knowledge to determine ... 

Keywords: audio/text synergy, karaoke, lyric alignment, music knowledge, vocal detection 

48 Web mining and clustering: Stylistic and lexical co-training for web block classification Q 
Chee How Lee, Min-Yen Kan, Sandra Lai 

November 2004 Proceedings of the 6th annual ACM international workshop on Web 
information and data management 

Full text available: ^ pdf(349.91 KB) Additional Information: full citation, abstract, references, index terms 

Many applications which use web data extract information from a limited number of regions 
on a web page. As such, web page division into blocks and the subsequent block 
classification have become a preprocessing step. We introduce PARCELS, an open-source, 
co-trained approach that performs classification based on separate stylistic and lexical views 
of the web page. Unlike previous work, PARCELS performs classification on fine-grained 
blocks. In addition to table-based layout, the system handl ... 

Keywords: PARCELS, co-training, lexical and stylistic learners, web page block 
classification, web page division 



49 Image Categorization by Learning and Reasoning with Regions 
Yixin Chen, James Z. Wang 

August 2004 The Journal of Machine Learning Research, volume 5 
Full text available: ^DdfM.31 MB) Additional Information: full citation, abstract 




Designing computer programs to automatically categorize images using low-level features is 
a challenging research topic in computer vision. In this paper, we present a new learning 
technique, which extends l^ultiple-Instance Learning (MIL), and its application to the 
problem of region-based image categorization. Images are viewed as bags, each of which 
contains a number of instances corresponding to regions obtained from image 
segmentation. The standard MIL problem assumes that a bag is labeled p ... 

50 Item-based top-A/ recommendation algorithms Q 
Mukund Deshpande, George Karypis 

January 2004 ACM Transactions on Information Systems (TOIS), Volume 22 issue i 
Full text available: ^pdf(240.61 KB) Additional Information: full citation, abstract , references, index terms 



The explosive growth of the world-wide-web and the emergence of e-commerce has led to 
the development of recommender systems— a personalized information filtering technology 
used to identify a set of items that will be of Interest to a certain user. User-based 
collaborative filtering Is the most successful technology for building recommender systems 
to date and is extensively used in many commercial recommender systems. Unfortunately, 
the computational complexity of these methods grows I ... 

Keywords: e-commerce, predicting user behavior, world wide web 



51 Sources of Success for Boosted Wrapper Induction B 
David Kauchak, Joseph Smarr, Charles Elkan 

August 2004 The Journal of Machine Learning Research, volume 5 

Full text available: ^ pdff281.46 KB) Additional Information: full citation, abstract, index terms 

In this paper, we examine an important recent rule-based information extraction (IE) 
technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider 
variety of tasks than previously studied, including tasks using several collections of natural 
text documents. We investigate systematically how each algorithmic component of BWI, in 
particular boosting, contributes to its success. We show that the benefit of boosting arises 
from the ability to reweight examples to learn specifl ... 
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52 Distributional clustering of English words 
Fernando Pereira, Naftali Tishby, Lillian Lee 

June 1993 Proceedings of the 31st conference on Association for Computational 
Linguistics 

Full text available: ^ Pdf(756.61 KB) 

S Additional Infonnation: full citation, abstract , references, citings 

Publisher Site 

We describe and evaluate experimentally a method for clustering words according to their 
distribution in particular syntactic contexts. Words are represented by the relative frequency 
distributions of contexts in which they appear, and relative entropy between those 
distributions is used as the similarity measure for clustering. Clusters are represented by 
average context distributions derived from the given words according to their probabilities of 
cluster membership. In many cases, the cluster ... 

53 The automatic construction of large-scale corpora for summarization research 
Daniel Marcu 

August 1999 Proceedings of tlie 22nd annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: ^ pdf (122.54 KB) Additional Information: full citation, references, citings, index terms 
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August 1999 Proceedings of the fourth ACM conference on Digital libraries 

Full text available: IS pdf(256.74 KB^ Additional Infonnation: full citation, references, index terms 



Keywords: MEDLINE, MEDSPACE, concept space, interspace, medical Informatics, scalable 
semantics, semantic indexing, semantic retrieval 



55 Topic detection and tracking in English and Chinese 
Charles L Wayne 

November 2000 Proceedings of the fifth international workshop on on Information 
retrieval with Asian languages 

Full text available: ^ pdf(688.09 KB) Additional Information: full citation, abstract, references 

Topic Detection and Tracking (TDT) refers to automatic techniques for discovering, 
threading, and retrieving topically related material in streams of data. Newswire and 
broadcast news are the canonical sources. In 1999, TDT research was extended from 
English to Chinese, and carefully annotated multilingual corpora were created. Researchers 
devised clever approaches to the cross-language challenge, and formal performance 
evaluations yielded very promising results. This paper outlines the 1999 ... 

Keywords: detection, segmentation, speech, text, topic, tracking 



56 Research track: SEWeP: using site semantics and a taxonomy to enhance the Web 
personalization process 
M. Eirinaki, M. Vazirgiannis, I. Varlamis 

August 2003 Proceedings of the ninth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(429.65 KB) Additional Information: full citation, abstract, references, index terms 

Web personalization is the process of customizing a Web site to the needs of each specific 
user or set of users, taking advantage of the knowledge acquired through the analysis of the 
user's navigational behavior. Integrating usage data with content, structure or user profile 
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data enhances the results of the personalization process. In this paper, we present SEWeP, 
a system that makes use of both the usage logs and the semantics of a Web site's content 
in order to personalize it. Web content is ... 

Keywords: Web mining, Web personalization, concept hierarchies, semantic annotation of 
Web content 
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58 Word association norms, mutual information, and lexicography Q 
Kenneth Ward Church, Patrick Hanks 

March 1990 Computational Linguistics, volume 16 issue i 

Full text available: 'P| pdf(856.69 KB^ 

ST ^ Additional Information: full citation , abstract, references , citings 

w Publisher Site 

The term word association Is used in a very particular sense in the psycholinguistic 
literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it 
follows a highly associated word such as doctor. ) We wlii extend the term to provide the 
basis for a statistical description of a variety of interesting linguistic phenomena, ranging 
from semantic relations of the doctor/nurse type (content word/content word) to lexico- 
syntactic co-occurrence const ... 

59 The SMART lab report H 

Mike Lesk, Donna Harman, Edward A. Fox, Harry Wu, Chris Buckley 
April 1997 ACi^ SIGIR Forum, Volume 31 issue 1 

Full text available: ^pdf(1.65 IVIB) Additional Information: full citation , index terms 

60 Industry track papers: On the potential of domain literature for clustering and Bayesian Q 
network learning 

Peter Antal, Patrick Glenisson, Geert Fannes 

July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(1.10 MB) Additional Information: full citation , abstract , references , index terms 

Thanks to its increasing availability, electronic literature can now be a major source of 
information when developing complex statistical models where data Is scarce or contains 
much noise. This raises the question of how to integrate information from domain literature 
with statistical data. Because quantifying similarities or dependencies between variables is a 
basic building block in knowledge discovery, we consider here the following question. Which 
vector representations of text and which st ... 

Keywords: Bayesian networks, clustering, data mining, text mining 
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